The purpose of this pipeline is to provide additional analysis to complement the analysis by Metabolon, where they identified biochemicals (metabolites) that differed significantly across experimental groups. In the Metabolon data, each “sub-pathway” classifies a collection of metabolites, and each “super-pathway” classifies a collection of sub-pathways. Hypothesis testing at the sub-pathway level can provide valuable insight into how the experimental conditions affect metabolic profiles. This pipeline walks through how to perform hypothesis testing on the metabolomics data at the sub-pathway level and provides useful tables and figures to aid our understanding. The below figure highlights the steps we will be taking, which highlight the main sections of this pipeline by:
Normalization and Standardization
Data Exploration
Subpathway Analysis
Pairwise comparisons
Plots
img = image_read("../Workflow.png")
image_trim(img)
The R “chunks” separate each of the analysis steps within the main sections of the pipeline. In some chunks, there may need to be lines of code that need to change from experiment to experiment. This pipeline provides a list of steps for each chunk and highlights places where you need to make changes.
To get started, in the Metabolomics pipeline working directory, there are five folders with the following folder structure:
You should save the .xlsx file from Metabolon and any additional metadata files to the Data/Metabolon folder. The main output folder will contain the plots and tables generated from this pipeline. Additionally, you can run this pipeline as a report and save it as a .pdf or .html. The report saves to the “Outputs/Reports” folder.
We will use the Sample Meta Data, Peak Area Data, and the chemical annotation provided in the Data Tables Excel file from Metabolon for the analysis. We can also add Sample Meta Data not captured in the Metabolon Excel file. Therefore, we will need links to the following files
If there are no additional variables, then you can set the additional_vars value to NULL.
# 1. Metabolome excel file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
# 2. Path to additional variables. Please make this NULL if there are not any additional
# Files
additional_vars = "../Data/Metabolon/AdditionalVars.xlsx"
Once you provide a path to the data, we can load the data into R. This chunk will access the Metabolon Excel file specified above and pull out three data sets within the Excel file. Below is a table that shows the data that we pull from the Excel file and the tab within the Metabolon Excel sheet.
| Data Accessed | Tab |
|---|---|
| Raw Peak | Peak Area Data |
| Meta Data | Sample Meta Data |
| Chemical Annotation | Chemical Annotation |
If the names of the tabs containing this data are different in your Excel file, there are two solutions. You can change the name of the tabs to match those listed above, or you can replace the “sheet” variable in the “read.xlsx” function to match the naming convention within your Metabomon Excell workbook.
In the following chunk we are:
Loading the peak data from the Metabolon .xlsx file.
Load the meta data from the Metabolon .xlsx file. If any additional variables are provided in the links above, these variables will be added to the meta data.
Loading in the chemical annotation data.
# 1. load peak data Peak area data
peak_data <- read.xlsx(met_excel,
sheet = "Peak Area Data")
# 2. Load metadata with no additional variables
if(is.null(additional_vars)){
meta_data <- read.xlsx(met_excel,
sheet = "Sample Meta Data")
}
# Load meta_data with additional variables
if(!is.null(additional_vars)){
meta_data <- read.xlsx(met_excel,
sheet = "Sample Meta Data")
additional <- read.xlsx(additional_vars)
meta_data <- merge(additional
,meta_data, by="PARENT_SAMPLE_NAME")
}
# 3. Chemical names key
chem_data <- read.xlsx(met_excel,
sheet = "Chemical Annotation")
After reading the data, we want to inspect the loaded data to ensure everything is stored properly. In the following chunk we will view the first 6 rows of the metadata, chemical annotation data, and the first 6 row and ten columns of the peak data. Therefore, the steps we are taking are,
First 6 rows of metadata
First 6 rows of the chemical annotation data
First 6 rows and 10 columns of the peak data.
# 1. First 6 rows of meta data
head(meta_data)
# 2. First 6 rows of the chemical annotation data
head(chem_data)
# First 6 rows and 10 columns of the peak data.
head(peak_data[,1:10])
The goal of this section is to design a table that resembles the study design of your experiment. This step allows us to see descriptive statistics of the samples in our data. For this, we are using the metadata to define a table with a similar structure as the table below.
| Strat 1 | Strat 2 | |
|---|---|---|
| Var1 | XX | XX |
| Var 2 | XX | XX |
| Var 3 | XX | XX |
Alternatively, we can define a table with only the number of observations within each experimental group. The following code will create the desired table 1 for both scenarios.
In this chunk, we will label the metadata’s variable names. Labeling the variable names will allow the tables to display your chosen variable names while maintaining the original variable name within the data. We do this in two steps.
Change the name assigned to var1 to match the name of the variable in the data set which you want to include in the table
Repeat this for as many variables as needed.
Choose a single variable to stratify the table by.
Assign the variable a label name
# 1. Choose the meta data variable name.
var1 = "GROUP_NAME"
var2 = "TIME1"
# 2. Choose a single variable to stratify the table by.
stratified_var = "Gender"
# 3. Assign the variable a label name
label(meta_data[,var1]) <- "Treatment Group"
label(meta_data[,var2]) <- "Time"
# 4. Assign a stratified varibale a label name for the table.
label(meta_data[,stratified_var]) = "Gender"
Now that you have properly assigned the variables, we will create table 1. To create the table, we will be using the table1 function. You must change the arguments depending on what you want the table to display. For example: * Table 1 without interaction:
\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}}, \ \text{data = meta_data} ) \]
\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}} \underbrace{| \ stratified\_var}_{\text{Statified Variable}}, \ \text{data = meta_data} ) \] Above is how you change the table1 function to display the table with and without a stratified variable. In the next chunk, we utilize the following steps.
Create table 1. In this step you will need to use the same variable names as in the above chunk within the table1 function.
Display the table
Save the table in the folder “/Outputs/Tables/” with the file name “table1”.
# 1. Creates table1
tbl1 <- table1(~ TIME1 + GROUP_NAME| Gender
, data = meta_data)
# 2. Displays table 1
tbl1
| Female (N=44) |
Male (N=42) |
Overall (N=86) |
|
|---|---|---|---|
| Time | |||
| End | 14 (31.8%) | 15 (35.7%) | 29 (33.7%) |
| Onset | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| PreSymp | 15 (34.1%) | 13 (31.0%) | 28 (32.6%) |
| Treatment Group | |||
| 1902 | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| Control | 14 (31.8%) | 14 (33.3%) | 28 (32.6%) |
| WT | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
# 3. saves table 1
t1flex(tbl1) %>%
save_as_docx(path = paste0("../Outputs/Tables/","table1.docx"))
In its raw form, the peak data contain counts for each metabolite in each sample. Standardization and normalization are important steps to improve the signal-to-noise ratio. In the next section, we are implementing the same standardization and normalization methods used by Metabolon.
As part of the Metabolon workflow, the raw peak data goes through a series of processing steps to prepare the data for analysis. The first part of this process involves handling the missing peak data in two stages.
1.) Remove chemicals with more than 50% of the observations missing.
2.) For the remaining chemicals with missing data, impute the missing values with 1/5 of the minimum positive value.
For the first step, we will summarize the missingness for the metabolites in a table to show us which metabolites will exclude in the analysis. In the following chunk we:
Create a data set with the number of missing observations for each metabolite. As part of this step, we are also filtering so that only metabolites are included that have more than 50% of the observations missing.
Filter the columns of the peak data to exclude the metabolites identified in step
Create a table that shows the metabolites excluded with the proportion of missing observations for each metabolite.
Display the table of excluded metabolites
Save table in the “Outputs/Tables” folder with the filename “MissingChemicals”
# 1. Create a dataset with the number of missing observations for each metabolite.
missing_variables <- peak_data %>%
mutate(across(everything(), ~ifelse(is.na(.), 1, 0))) %>%
summarise_all(sum) %>%
pivot_longer(cols = everything(), names_to = "CHEM_ID", values_to = "number_missing") %>%
left_join(select(chem_data, CHEM_ID, CHEMICAL_NAME) %>%
mutate(CHEM_ID = as.character(CHEM_ID)),
by = "CHEM_ID") %>%
arrange(number_missing) %>%
filter(number_missing > 0.5*(nrow(peak_data)))
# 2. Filter the columns of the peak data to exclude the metabolites identified in step 1.
peak_data_filter <- peak_data %>%
select(-missing_variables$CHEM_ID)
# 3. Create a table that shows the metabolites excluded
miss_tabl<- missing_variables %>%
select(CHEMICAL_NAME, number_missing) %>%
mutate( prop_missing = round(number_missing/nrow(peak_data), digits = 3)) %>%
flextable() %>%
set_header_labels(CHEMICAL_NAME="Chemical Name", number_missing="Number Missing", number_missing = "Proportion Missing") %>%
set_table_properties(layout = "autofit") %>%
theme_vanilla()
# 4. Display Table
miss_tabl
Chemical Name | Number Missing | prop_missing |
|---|---|---|
3-methoxycatechol sulfate (1) | 44 | 0.512 |
X-24625 | 44 | 0.512 |
N-stearoyl-sphinganine (d18:0/18:0)* | 45 | 0.523 |
ethyl glucuronide | 45 | 0.523 |
N-acetylmethionine sulfoxide | 45 | 0.523 |
cis-3,4-methyleneheptanoate | 45 | 0.523 |
X-24637 | 45 | 0.523 |
N6-succinyladenosine | 46 | 0.535 |
2,2'-methylenebis(6-tert-butyl-p-cresol) | 46 | 0.535 |
indoleacetoylcarnitine* | 46 | 0.535 |
X-24970 | 46 | 0.535 |
palmitoyl-sphingosine-phosphoethanolamine (d18:1/16:0) | 47 | 0.547 |
X-18922 | 47 | 0.547 |
cysteine sulfinic acid | 48 | 0.558 |
palmitoleoyl-linoleoyl-glycerol (16:1/18:2) [1]* | 48 | 0.558 |
phenylacetylglutamine | 50 | 0.581 |
X-25396 | 50 | 0.581 |
maltol sulfate | 51 | 0.593 |
nervonoylcarnitine (C24:1)* | 53 | 0.616 |
phenylacetate | 55 | 0.640 |
4-acetamidobenzoate | 55 | 0.640 |
X-12097 | 56 | 0.651 |
guanosine-3',5'-cyclic monophosphate (cGMP) | 57 | 0.663 |
stearoyl-arachidonoyl-glycerol (18:0/20:4) [1]* | 58 | 0.674 |
glycitein glucuronide (2)* | 58 | 0.674 |
X-11648 | 63 | 0.733 |
glyco-beta-muricholate** | 64 | 0.744 |
bilirubin (Z,Z) | 67 | 0.779 |
X-24548 | 75 | 0.872 |
X-24748 | 78 | 0.907 |
valerylcarnitine (C5) | 84 | 0.977 |
# 5. Save table
save_as_docx(miss_tabl,path = "../Outputs/Tables/MissingChemicals.docx")
Once we remove the metabolites with more than 50% of the observations missing, we impute 1/5 of the minimum value for the remainder with missing values. We do this in the following steps.
Initialize the new peak_data_imputed matrix
Find the minimum value for each metabolite and compute 1/5 of that value
Create a for loop such that if a metabolite has a missing observation, then 1/5 of the minimum value for that metabolite is imputed.
# 1. Initialize the new peak_data_imputed matrix
peak_data_imputed <- peak_data_filter
# 2. Find the minimum value for each metabolite and compute 1/5 of that value
peak_data_mins <- peak_data_imputed %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise_all(min, na.rm = T) %>%
mutate(across(everything(), ~(./5)))
# 3. Impute the value
for(i in colnames(peak_data_mins)){
if(length(peak_data_imputed[,i][is.na(peak_data_imputed[,i])]) > 0){
peak_data_imputed[which(is.na(peak_data_imputed[,i])),i] <- as.numeric(peak_data_mins[i])
}
}
Once we impute the missing data, the Metabolon workflow follows a three-step process for normalization.
A. Each metabolite’s peak data is standardized by the median value of the metabolites observations.
B. Each observation is log10 transformed
C. Each chemical is autoscaled by subtracting the mean from each value and dividing by the standard deviation.
The next two chunks will perform these three steps to produce a dataset called “peak_data_scale”. Peak_data_scale will be the data set used for data exploration in the next section.
In the first chunk we:
Initialize a new peak_data_norm matrix
Create a matrix containing the median value for each metabolite
Divide each value for each metabolite by the median value of that metabolite
Log10 transform all of the values
# 1. Initialize a new peak_data_norm matrix
peak_data_norm <- peak_data_imputed
# 2. Create a matrix containing the median value for each metabolite
peak_data_med <- peak_data_norm %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise_all(median, na.rm = T)
# 3. Divide each value for each metabolite by the median value of that metabolite
for(i in colnames(peak_data_med)){
peak_data_norm[,i] <- peak_data_norm[,i]/peak_data_med[,i]
}
# 4. Log10 transform all of the values
peak_data_log <- peak_data_norm %>%
mutate_if(is.numeric, log10)
For the next chunk, we “auto-scale” the features. We will do this in two steps:
Initialize the new matrix peak_data_scale, this is the matrix that we will use in down stream tasks.
Auto-scale by subtracting the mean and dividing by the standard deviation of each metabolite.
# 1. Initialize the new matrix peak_data_scale
peak_data_scale <- peak_data_log
# 2. Auto-scale
for(i in colnames(peak_data_log)[-1]){
peak_data_scale[,i] <- (peak_data_scale[,i]-mean(peak_data_scale[,i]))/sd(peak_data_scale[,i])
}
Now that we have imputed and normalized our data, we need to save the data for subsequent sections. The data which we will use downstream are:
Meta data (meta_data)
Peak Scaled Data (peak_data_scale)
Chemical Annotation data (chem_data)
We will save these data sets to the “Data/Processed” folder.
# 1. Save meta data
write.csv(meta_data, file = "../Data/Processed/meta_data.csv", row.names = F)
# 2. Save peak scaled data
write.csv(peak_data_scale, "../Data/Processed/peak_data_scale.csv",row.names = F)
# 3. save chemical annotaion data
write.csv(chem_data, file = "../Data/Processed/chem_data.csv", row.names = F)
In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods for data exploration:
A.) Principal component analysis
B.) Heatmaps
Before diving into the data exploration, we need to read the processed data from the Normalization and Standardization Section. We will need to read the following files
Meta data: We will need to call this meta_data by default for the other down- stream tasks. If you make any changes to this variable name then you will have to make changes in down-stream chunks.
Scales peak data: We will need to call this peak_data_scale by default for the other down stream-tasks. If you make any changes to this variable name then you will have to make changes in down-stream chunks.
Chemical annotation data: We will need to call this chem_data by default for the other down-stream tasks. If you make any changes to this variable name then you will have to make changes in down-stream chunks.
# 1. Read meta-data
meta_data <- read.csv("../Data/Processed/meta_data.csv", check.names = F)
# 2. Read scaled peak data
peak_data_scale <- read.csv("../Data/Processed/peak_data_scale.csv", check.names = F)
# 3. Read chemical annotations
chem_data <- read.csv("../Data/Processed/chem_data.csv", check.names = F)
In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component accounts for the largest possible amount of variance within the data. The second principal component accounts for the largest amount of remaining variance, and so on. Additionally, each of the principal components produced by PCA is uncorrelated with each of the other principal components.
Here, we will graph the first two principal components of our dataset. In the PCA plot, we can label each point based on variables from the metadata.
The following chunk runs PCA on the scale_peak_data matrix in the following steps:
Set the meta analysis variables from the metadata that will be used to categorize points on the PCA plot. You must always include the variable name that maps the observations from the scaled_peak_data to the meta_data. This is typically the variable name “PARENT_SAMPLE_NAME”.
Merge the peak_data_scale and the meta_data matrices together to define the pca_dat
Create a list of all of the metabolite names in the newly generated pca_dat matrix.
Run PCA of the pca_dat matrix containing only the metabolites.
Define the plot labels. This is a character string matching the name of one of the variables in the meta data.
Create and display the PCA plot
Save the PCA plot in the “Outputs/Plots” folder with the file name PCA.pdf
# 1. Set the meta analysis variables
meta_analysis_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# 2. Merge the peak_data_scale and the meta_data matrices together
pca_dat <- peak_data_scale %>%
left_join(meta_data[,meta_analysis_variables],
by = "PARENT_SAMPLE_NAME") %>%
select(all_of(meta_analysis_variables),everything())
# 3. Create a list of all of the metabolite names
chems <- names(pca_dat)[(names(pca_dat)%in% chem_data$CHEM_ID)]
# 4. Run PCA of the pca_dat matrix containing only the metabolites.
res.pca <- PCA(pca_dat[chems],
graph = F)
# 5. Define labels
meta_var = "Gender"
# 6. Create figure
fviz_pca_ind(res.pca,
label = "none",
habillage = as.factor(pca_dat[,meta_var]))
# 7. Save figure
ggsave(paste0("../Outputs/Plots/","PCA.pdf"), width = 10, height = 10)
Suppose you notice a variable with clearly separated groups, and is not a variable of interest. In that case, you may want to consider stratifying your analysis downstream by the values of that variable. For example, if we are looking at the effects of a specific drug, and we notice in the above plot that the samples are grouped by sex, then we may want to consider stratifying the analysis by male and female.
Another exploratory analysis tool we can use is heatmaps, a gridded plot based on the x-axis- and y-axis labels. The color of the spot on the grid is based on the value’s intensity. For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log10 scaled peak value for each chemical in each observation. We can order our observations to see intensity patterns based on our experimental conditions. To do this, we need to merge the chemical annotation, meta, and scaled peak data. Then we need to arrange the data based on our experimental conditions. We have developed an R function (create_heatmap) that does this in one step. The create_heatmap function takes the following arguments.
The first three arguments of the function are data frames (data frame names are in parenthesis). The meta-analysis variables are specified in the meta_analysis_variables, which contains all of the variables from the metadata that you would like to include in the heatmap.
The main utility of create_heatmap data is in (…), which allows you to arrange the data based on experimental conditions. If you are familiar with dplyr, the arrange function orders samples based on the arguments passed into (…). If you are unfamiliar with dplyr, an overview of the arrange function is here.
We will first look at the heatmap with all metabolites. The resulting heatmap saves into the Outputs/Plots folder under the filename “AllMetabolites.pdf.”
In this chunk we complete the following steps.
Define the meta analysis variables that will be used to order the heatmap
Run the create_heatmap function defined in the R script folder. This function is going to produce the matrices needed for the heatmap
Create a palette for the heatmap. By default we are using a red/blue palette.
Create the vals and meta matrices that will define the values of the heatmap and the labels of the heatmap.
Display heatmap
Save heatmap in the “Outputs/Plots” folder under the filename “Heatmap_all”
# 1. define meta analysis variables
meta_analysis_variables <- meta_analysis_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# 2. Create heatmap data
heatmap_data <- create_heatmap(peak_data_scale,chem_data,meta_data, meta_analysis_variables,
Gender, GROUP_NAME, desc(TIME1)) # Arranges data frame for heatmap (...)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(meta_analysis_variables)
##
## # Now:
## data %>% select(all_of(meta_analysis_variables))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# 3. Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# 4. Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_meta_data
# 5. Create heatmap and save.
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA)
# 6. Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA,filename = paste0("../Outputs/Plots/","Heatmap_all.pdf"))
Looking at the heatmap created in the previous chunk with all of the metabolites can make it challenging to determine which line goes with which metabolite. To overcome this, we can filter the metabolites based on which metabolites have the highest mean log-scaled peak data across all observations. In the heatmap below, we select the top 50 metabolites. Once we know the top 50 metabolites with the highest mean value, we can filter the peak_data_log data only to include those metabolites. Then we can run the create_heatmap function again and create the heatmap data. To do this the following steps are completed:
Define the meta analysis variables to use in the heatmap
Select the top 50 metabolites from the scaled peak data
Filter the scaled_peak_data to only include the top 50 metabolites
Generate heatmap data using the create_heatmap function
Create a palette for the heatmap. By default, we are using a red/blue palette.
Create the vals and meta matrices that will define the values of the heatmap and the labels of the heatmap.
Display heatmap
Save heatmap in the “Outputs/Plots” folder under the filename “HeatmapTop50Metabolites”
# 1. Define meta analysis variables
meta_analysis_variables <- c("PARENT_SAMPLE_NAME","GROUP_NAME", "TIME1", "Gender")
# 2. Find the top 50 metabolites
select_variables <- peak_data_scale %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise(across(everything(), mean, na.rm = T)) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), mean, na.rm = T)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
# 3. Filter to the top 50 metabolites
peak_data_scale_top50 <- peak_data_scale %>%
select(PARENT_SAMPLE_NAME, select_variables$name)
# 4. Create heatmap data
heatmap_data <- create_heatmap(peak_data_scale_top50 ,chem_data,meta_data, meta_analysis_variables,
Gender, GROUP_NAME, desc(TIME1)) # Arranges data frame for heatmap
# 5. Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# 6. Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_meta_data
# 7. Create heatmap and save.
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA)
# 8. Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA,filename = paste0("../Outputs/Plots/","HeatmapTop50Metabolites.pdf"))
We may also want to stratify our heatmap based on a specific variable within our data set. To do this, we have created the filter_heatmap function, which takes the output from the create_heatmap function and filters the heatmap data. The filter_heatmap function takes three arguments.
heatmap_data - List of data frames from the create_heatmap function.
drop_vars - Any metadata variables in the heatmap data that you would like to drop from the analysis. For instance, if we are looking at only male samples we may want to drop the gender variable because it is no longer informative.
(…) arguments for the dplyr filter function. This will allow you to filter the data based on the stratification you want. For example: Gender==“Male”
In the chunk below we are going to produce a heatmap for a single strata by:
Defining meta analysis variables
Selecting the top 50 metabolites
Filter the scaled_peak_data on the top 50 metabolites
Generate the heatmap data with the create_heatmap function
Filter the generated heatmap data on a specific strata using the filter_heatmap function.
Create a palette for the heatmap. By default we are using a red/blue palette.
Create the vals and meta matrices that will define the values of the heatmap and the labels of the heatmap.
Display heatmap
Save heatmap in the “Outputs/Plots” folder under the filename “HeatmapTop50MetabolitesStrat1”. This filename should be changed to reflect the name of the strata instead of “Strat1”.
# define meta analysis variables
meta_analysis_variables <- c("PARENT_SAMPLE_NAME","GROUP_NAME", "TIME1", "Gender")
# Find the top 50 metabolites by log peak area
select_variables <- peak_data_scale %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise(across(everything(), mean, na.rm = T)) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
# Filter to the top 50 metabolites
peak_data_scale_top50 <- peak_data_scale %>%
select(PARENT_SAMPLE_NAME, select_variables$name)
# Create heatmap data
heatmap_data <- create_heatmap(peak_data_scale_top50 ,chem_data,meta_data, meta_analysis_variables,
Gender, GROUP_NAME, desc(TIME1)) # Arranges data frame for heatmap
# Filter heatmap data to males only
Filtered_heatmap <- filter_heatmap(heatmap_data,drop_vars = "Gender",
Gender=="Male") # identify the strata you want
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(drop_vars)
##
## # Now:
## data %>% select(all_of(drop_vars))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# to visualize.
# Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# Values for heatmap
vals = Filtered_heatmap$heatmap_data_vals
meta = Filtered_heatmap$heatmap_meta_data
# Create heatmap and save.
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA)
# Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA,filename = paste0("../Outputs/Plots/","HeatmapTop50MetabolitesStrat1.pdf"))
Before diving in, lets load the matrices that we will need for the Sub-pathway Analysis section. The three matrices we will need are:
Meta analysis data (meta_data): Contains all of the information about the samples.
Scaled peak data (peak_data_scale): The is the auto-scaled peak data derived in the “Normalization and Standardization” section.
Chemical annotation data (chem_data): Contains all of the information about the metabolites, including which sub-pathway and super-pathway they belong to.
# 1. Read meta-data
meta_data <- read.csv("../Data/Processed/meta_data.csv", check.names = F)
# 2. Read scaled peak data
peak_data_scale <- read.csv("../Data/Processed/peak_data_scale.csv", check.names = F)
# 3. Read chemical annotations
chem_data <- read.csv("../Data/Processed/chem_data.csv", check.names = F)
In the chemical annotation file, we will see that each metabolite is within a sub-pathway, and each sub-pathway is within a super-pathway. There are several metabolites within each sub-pathways and several sub-pathways within each Super-pathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level, which is part of the Metabolon workflow. However, since multiple metabolites are within a sub-pathway, it is challenging to test if the treatment affected the peak data at the sub-pathway level. For this, we will be utilizing a combined fisher test. The combined Fisher test combines the p-values for each metabolite within a sub-pathway to test the same hypothesis at the sub-pathway level.
To test our hypothesis at the sub-pathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.
1.) Interaction: \(log Peak = Treatment + Var2 + Treatment*Var2\)
2.) Parallel: $log Peak = Treatment + Var2 $
3.) Single: \(log Peak = Treatment\)
We are focusing only on the interaction term “Treatment*Var2” for this interaction model to test if there is a significant interaction between our treatment and the second variable. The parallel model is testing if the second variable is explaining a significant amount of variance in the peak scaled intensity, and the treatment model is testing if the treatment explains a significant proportion of the variance in the scaled peak intensity for each metabolite.
After running these three models for each metabolite, we will test at the sub-pathway level by combining the p-values of each model for each metabolite within the sub-pathway. We compute a chi-squared statistic to test at the sub-pathway level. For each model, we compute the chi-squared statistic by:
\[ \tilde{X} = -2\sum_{i=1}^k ln(p_i) \] where \(k\) is the number of metabablites in the sub-pathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\).
Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,
Independence: Each observation is independent of all other observations. Therefore, if you have collected multiple samples from the same subject then this assumption may be violated.
Normality: The metabolite log-scaled intensities follow a normal distribution within each of the treatment groups.
Homoscedasticity: Equal variance between treatment groups.
In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumptions for each metabolite being tested within the sub-pathway. For example, when we test the interaction model at the sub-pathway level, the p-values for interaction test are independent for each metabolite within the sub-pathway.
Before we run the combined fisher analysis, we need to create the analysis data, which will contain all of the information required for the analysis. This data set includes the scaled peak data, and metadata all in one data frame. We will create this data and then save it to the Data/Processed folder. By default, this data will save as “analysis.csv”. You can change the saved name in the write.csv function. Therefore, in the following chunk we:
Define the meta analysis variables to include in the analysis data set.
Create the analysis data by merging the meta data, scaled peak data and the chemical annotation data.
Save the analysis data to the “Data/Processed/” folder under the file name “analysis_data”
# 1. Define meta analysis variables
meta_analysis_variables <- meta_analysis_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# 2. Create analysis data
analysis_data <- peak_data_scale %>%
pivot_longer(cols = -PARENT_SAMPLE_NAME,
names_to = "CHEM_ID") %>%
left_join(select(chem_data, CHEM_ID, CHEMICAL_NAME) %>%
mutate(CHEM_ID= as.character(CHEM_ID)),
by = "CHEM_ID") %>%
left_join(select(meta_data, meta_analysis_variables),
by = "PARENT_SAMPLE_NAME") %>%
pivot_wider(id_cols = meta_analysis_variables,
names_from = CHEMICAL_NAME, values_from = value)
# 3. Save the analysis data to Data/Processed folder
write.csv(analysis_data, file = "../Data/Processed/analysis_data.csv", row.names = F)
Now that we have our analysis data, we test the metabolomic data at
the sub-pathway level utilizing the ancova_function defined
in the R script. This function takes the following arguments:
Anova data (anova_data): The data frame containing the combined meta, peak and chemical annotation data.
Meta analysis variables (meta_analysis_variables): The metadata variables which includes the treatment groups and any other variables from the metadata. As a note this should always include the “PARENT_SAMPLE_NAME” variable which is used to merge the data frames.
Treatment only (treatment.only): This is a TRUE/FALSE. This should be set to TRUE if you are only including a treatment variable and set to false if you are including more than 1 variable. Setting this to TRUE will only run test the metabolites and the sub-pathways with the Single model.
Treatment variable (treat_var): The name of the main treatment variable that needs to be tested.
Additive variables (additive_vars): Defines which variables from the meta_analysis_variables to include in the analysis.
Interaction variables (interaction_vars): Defines the interactions to include in the modeling. The set of variables to include in the interactions must be a subset of the variables from the additive_vars. An example of how this should look look is \(c("var1*var2")\). This will create interactions between var1 and var2.
Note: The ancova_function defined in the R script can only handle the modeling with two variables. If additional variables are added, some modifications will need to be made in the R script.
To run this analysis, we utilize the following steps.
Read in the analysis data and the chemical annotation data.
Define meta analysis variables to use in the modeling.
Run the ancova_function with the arguments listed above
Save the pathway analysis results to the “Data/Results” folder with the name “NonStratified_Full_ancova_results”
# 1. Read in Analysis data and chemical annotation data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
chem_data <- read.csv("../Data/Processed/chem_data.csv", check.names = F)
# 2. Define meta analysis variables
meta_analysis_variables <- meta_analysis_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# Run the ancova function
path_dat <- ancova_function(analysis_data,
chem_data = chem_data,
meta_analysis_variables = meta_analysis_variables,
treatment.only = F,
additive_vars=c("GROUP_NAME","TIME1"),
interaction_vars =c("GROUP_NAME*TIME1"),
treat_var = "GROUP_NAME")
# Save results
write.csv(path_dat,file = "../Data/Results/NonStratified_Full_ancova_results.csv", row.names = F)
We may notice that we need to stratify our analysis if we believe the effects of the model are different between the levels of the stratified variable. For example, we may notice that sex can change the effects of our treatment, and we may want to look at males and females separately. One way to do this is to subset the data prior to running the ancova function.
In the following chunk you can specify a variable to stratify by. For each level of the stratified variable, we:
1.) Subset the analysis data based on the stratified variable.
2.) Run the ancova function with the subsetted data.
3.) Save the stratified data in “Data/Results” as “analysis_data_{Val}” where {Val} is the value of the stratified variable. For example, if we stratified by sex then the data will be saved as “analysis_data_female for the females.
4.) Save the results in “Data/Results” and the file will be saved as “Statified_{Val}_ancova_results.csv” where {Val} is the value of the variable we stratified by. For example, if we stratified by sex, then the results for females would be saved as “Stratified_female_ancova_results.csv”
# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# define meta analysis variables
#meta_analysis_variables <- meta_analysis_variables <- c("PARENT_SAMPLE_NAME",
# "GROUP_NAME",
# "TIME1",
# "Gender")
# Defined stratified variable.
stratified_var = "Gender"
# Find unique values of this variable
uni_vals <- unique(analysis_data[,stratified_var])
# For each value perform the four steps listed above.
for(i in uni_vals){
# 1. Subset analysis data
strat_data = analysis_data[analysis_data[,stratified_var]==i,]
# 2. Run ancova function on the subsetted data
strat_path_results <- ancova_function(strat_data,
chem_data = chem_data,
meta_analysis_variables = meta_analysis_variables,
treatment.only = F,
additive_vars=c("GROUP_NAME","TIME1"),
interaction_vars =c("GROUP_NAME*TIME1"),
treat_var = "GROUP_NAME")
# 3 Save stratified data
write.csv(strat_data, paste0("../Data/Processed/analysis_data_",i,".csv"), row.names = F)
# 4 Save results
write.csv(strat_path_results, paste0("../Data/Results/Stratified_",i,"_ancova_results.csv"), row.names = F)
}
Once we have the Overall Analysis results, we will have tested all metabolites with an interaction, parallel and Single model. We then obtain a p-values through the combined fisher analysis for each sub-pathway. In the next few chunks, we will summarize the results with a few different tables. The first summary will show the number of significant sub-pathways for each model type. To do this we:
Read in the results from the overall analysis. Currently we are reading in the results from one of the strata’s in the stratified analysis. If you did not stratify the analysis you should change the path to the non stratified results. Additionally, if you stratified your analysis you may want to copy and paste this chunk to repeat this table for each strata.
Identify if the analysis only included a treatment variable. If you only included a treatment variable, then treatment_only should be set to true.
Structure the names of the levels for the model types in the table.
Create the table data used to count the number of significant subpathways for each model type.
Create the table by counting the number of sub-pathways for each model type.
Display the table
Save the table in “Outputs/Tables” folder under the file name “NumberOfSigPathwaysByModelType_males”. You may want to change this name to reflect the strata you are summarizing.
Note: If you only tested the sub-pathways for the Single model then only treatment is displayed.
Note: The model type for each sub-pathway is determined hiearchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.
# 1. Read in Results from the analysis step.
path_data <- read.csv("../Data/Results/Stratified_Male_ancova_results.csv", check.names = F)
# 2. Treatment only: Change this to TRUE if only the treatment was tested.
treatment_only = F
# 3. Structure levels
if(treatment_only){
levs <- c("Single","None")
}
if(!treatment_only){
levs = c("Interaction", "Parallel", "Single", "None")
}
# 4. Create table data
table_data <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct()
# 5. Create table
sig_subPaths <- count(table_data, model) %>%
arrange((model)) %>%
flextable() %>%
set_header_labels(model = "Model Type", n="Count")%>%
theme_vanilla() %>%
set_table_properties(layout = "autofit")
# 6. Display table
sig_subPaths
Model Type | Count |
|---|---|
Interaction | 81 |
Parallel | 15 |
Single | 3 |
None | 10 |
# 7. Save table
save_as_docx(sig_subPaths,path="../Outputs/Tables/NumberOfSigPathwaysByModelType_males.docx")
# 8. Save table_data
write.csv(table_data,"../Outputs/Tables/Table_data.csv", row.names = F)
Additionally, we can summarize the results by looking at the number of significant sub-pathways within a super-pathway for each model type. This summary is done by:
Read in the table data and chem data.
Structuring the model levels.
Create the table data used to count the number of significant sub-pathways for each model type.
Formulating the superpath data by:
Filtering all subpathways that did not have a significant subpathway.
Join the results with the chemical annoation data to group by the superpathway.
Summarize the percent of subpathways within superpathways that are significant.
Display table
Save table in the “Outputs/Tables” folder under the name “SigSuperPathwayPecentages”
Note: The model type for each sub-pathway is determined hierarchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.
# 1. Read in table data and chem data
table_data <- read.csv("../Outputs/Tables/Table_data.csv")
chem_data <- read.csv("../Data/Processed/chem_data.csv", check.names = F)
# 2. Structure Cases
if(sum(grepl("interaction",names(table_data))==0)){
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(sum(grepl("interaction",names(table_data)))>0){
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 3. Formulate the Superpathway results table.
superPath <- table_data %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, pval) %>%
right_join(chem_data %>% select(SUPER_PATHWAY, SUB_PATHWAY) %>% distinct(),
by = c("sub_pathway" = "SUB_PATHWAY")) %>%
filter(!is.na(sub_pathway)) %>%
mutate(sig = ifelse(is.na(pval), 0, 1)) %>%
group_by(SUPER_PATHWAY) %>%
summarise(percent_significant = round(mean(sig) * 100, 2),
number_significant = sum(sig),
pathway_count = n()) %>%
ungroup() %>% arrange(-percent_significant) %>%
transmute(SUPER_PATHWAY, percent_significant = paste0(number_significant, " / ", pathway_count,
" (", percent_significant, "%)")) %>%
flextable() %>%
set_header_labels(SUPER_PATHWAY="Super Pathway", percent_significant = "Percent Significant") %>%
set_table_properties(layout = "autofit") %>%
theme_vanilla()
# 4. Display table
superPath
Super Pathway | Percent Significant |
|---|---|
Amino Acid | 16 / 16 (100%) |
Energy | 2 / 2 (100%) |
Nucleotide | 8 / 8 (100%) |
Partially Characterized Molecules | 1 / 1 (100%) |
Xenobiotics | 5 / 5 (100%) |
Cofactors and Vitamins | 10 / 11 (90.91%) |
Carbohydrate | 7 / 8 (87.5%) |
Lipid | 45 / 53 (84.91%) |
Peptide | 4 / 5 (80%) |
# 5. Save table
save_as_docx(superPath,path = "../Outputs/Tables/SigSuperPathwayPecentages.docx")
We may also be interested in obtaining a table which contains all of the sub-pathways that had a significant model regardless of model type. To do this we create a table in the following steps.
Read in the table data
Structure model type levels.
Create pathway data that contains the sub-pathway, model type and the p-value for the model type with the the pathways that were non-significant filtered out.
Format table
Display table
Save table in the “Outputs/Tables” folder with the name “SigSubpathwayTable”
# 1. Read in table data
table_data <- read.csv("../Outputs/Tables/Table_data.csv")
# 2. Structure model type levels
if(sum(grepl("interaction",names(table_data))==0)){
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(sum(grepl("interaction",names(table_data)))>0){
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 3. Create pathway table
pathway_table <- table_data %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, model, pval) %>%
arrange(model, pval)
# 5. Format table
path_tab = pathway_table %>%
flextable() %>%
set_header_labels(sub_pathway="Subpathway", model = "Model Type", pval = "P-value") %>%
set_table_properties(layout = "autofit") %>%
colformat_double(digits = 3) %>%
theme_vanilla()
# Display table
path_tab
Subpathway | Model Type | P-value |
|---|---|---|
Glutamate Metabolism | Interaction | 0.000 |
Polyamine Metabolism | Interaction | 0.000 |
Fatty Acid, Dihydroxy | Interaction | 0.000 |
Tryptophan Metabolism | Interaction | 0.000 |
Lysine Metabolism | Interaction | 0.000 |
TCA Cycle | Interaction | 0.000 |
Leucine, Isoleucine and Valine Metabolism | Interaction | 0.000 |
Glycolysis, Gluconeogenesis, and Pyruvate Metabolism | Interaction | 0.000 |
Primary Bile Acid Metabolism | Interaction | 0.000 |
Purine Metabolism, (Hypo)Xanthine/Inosine containing | Interaction | 0.000 |
Long Chain Polyunsaturated Fatty Acid (n3 and n6) | Interaction | 0.000 |
Medium Chain Fatty Acid | Interaction | 0.000 |
Methionine, Cysteine, SAM and Taurine Metabolism | Interaction | 0.000 |
Chemical | Interaction | 0.000 |
Purine Metabolism, Adenine containing | Interaction | 0.000 |
Urea cycle; Arginine and Proline Metabolism | Interaction | 0.000 |
Alanine and Aspartate Metabolism | Interaction | 0.000 |
Sterol | Interaction | 0.000 |
Phospholipid Metabolism | Interaction | 0.000 |
Creatine Metabolism | Interaction | 0.000 |
Secondary Bile Acid Metabolism | Interaction | 0.000 |
Gamma-glutamyl Amino Acid | Interaction | 0.000 |
Food Component/Plant | Interaction | 0.000 |
Fatty Acid, Dicarboxylate | Interaction | 0.000 |
Purine Metabolism, Guanine containing | Interaction | 0.000 |
Long Chain Saturated Fatty Acid | Interaction | 0.000 |
Pyrimidine Metabolism, Orotate containing | Interaction | 0.000 |
Glutathione Metabolism | Interaction | 0.000 |
Long Chain Monounsaturated Fatty Acid | Interaction | 0.000 |
Vitamin B6 Metabolism | Interaction | 0.000 |
Vitamin A Metabolism | Interaction | 0.000 |
Pyrimidine Metabolism, Uracil containing | Interaction | 0.000 |
Pentose Metabolism | Interaction | 0.000 |
Pyrimidine Metabolism, Cytidine containing | Interaction | 0.000 |
Tocopherol Metabolism | Interaction | 0.000 |
Endocannabinoid | Interaction | 0.000 |
Aminosugar Metabolism | Interaction | 0.000 |
Fatty Acid, Monohydroxy | Interaction | 0.000 |
Glycerolipid Metabolism | Interaction | 0.000 |
Ceramides | Interaction | 0.000 |
Phosphatidylethanolamine (PE) | Interaction | 0.000 |
Phosphatidylinositol (PI) | Interaction | 0.000 |
Phosphatidylcholine (PC) | Interaction | 0.000 |
Sphingomyelins | Interaction | 0.000 |
Monoacylglycerol | Interaction | 0.000 |
Lysoplasmalogen | Interaction | 0.000 |
Lysophospholipid | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Long Chain Saturated) | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Medium Chain) | Interaction | 0.000 |
Ascorbate and Aldarate Metabolism | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Glycine) | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Monounsaturated) | Interaction | 0.000 |
Partially Characterized Molecules | Interaction | 0.000 |
Hexosylceramides (HCER) | Interaction | 0.000 |
Fatty Acid, Branched | Interaction | 0.000 |
Diacylglycerol | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Polyunsaturated) | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Hydroxy) | Interaction | 0.000 |
Fatty Acid, Amino | Interaction | 0.000 |
Plasmalogen | Interaction | 0.000 |
Dihydrosphingomyelins | Interaction | 0.000 |
Lactoyl Amino Acid | Interaction | 0.000 |
Unknown Metabolite | Interaction | 0.000 |
Pentose Phosphate Pathway | Interaction | 0.000 |
Glycosyl PE | Interaction | 0.003 |
Nicotinate and Nicotinamide Metabolism | Interaction | 0.010 |
Tyrosine Metabolism | Interaction | 0.010 |
Sphingosines | Interaction | 0.010 |
Sphingolipid Synthesis | Interaction | 0.010 |
Glycine, Serine and Threonine Metabolism | Interaction | 0.010 |
Thiamine Metabolism | Interaction | 0.010 |
Dipeptide | Interaction | 0.010 |
Fatty Acid Metabolism (Acyl Carnitine, Dicarboxylate) | Interaction | 0.010 |
Phenylalanine Metabolism | Interaction | 0.020 |
Histidine Metabolism | Interaction | 0.020 |
Disaccharides and Oligosaccharides | Interaction | 0.029 |
Fructose, Mannose and Galactose Metabolism | Interaction | 0.030 |
Ketone Bodies | Interaction | 0.034 |
Mevalonate Metabolism | Interaction | 0.040 |
Riboflavin Metabolism | Interaction | 0.040 |
Fatty Acid Metabolism (also BCAA Metabolism) | Interaction | 0.040 |
Hemoglobin and Porphyrin Metabolism | Parallel | 0.000 |
Drug - Topical Agents | Parallel | 0.000 |
Pyrimidine Metabolism, Thymine containing | Parallel | 0.000 |
Eicosanoid | Parallel | 0.000 |
Benzoate Metabolism | Parallel | 0.000 |
Tetrahydrobiopterin Metabolism | Parallel | 0.000 |
Guanidino and Acetamido Metabolism | Parallel | 0.000 |
Phosphatidylserine (PS) | Parallel | 0.000 |
Dipeptide Derivative | Parallel | 0.000 |
Glycogen Metabolism | Parallel | 0.001 |
Docosanoid | Parallel | 0.001 |
Acetylated Peptides | Parallel | 0.040 |
Purine and Pyrimidine Metabolism | Parallel | 0.044 |
Biotin Metabolism | Parallel | 0.049 |
Drug - Other | Parallel | 0.049 |
Phosphatidylglycerol (PG) | Single | 0.001 |
Oxidative Phosphorylation | Single | 0.026 |
Dihydroceramides | Single | 0.040 |
# Save table
save_as_docx(path = "../Outputs/Tables/SigSubpathwayTable.docx")
For the subpathways with a significant model, it can also be useful to look at the results for the significant model for each metabolite within the significant subpathway. In the following chunk we look at the metabolites within the top 2 most significant sub-pathways for the interaction model and the parallel model. In the following chunk we:
Read in the table data and results from the overall analysis.
Structure model type levels.
Create the “top_pathway_table” which filters the table data to only focus on metabolites with a significant model. Orders the data by model type and pvalue and then selects the top five sub-pathways. Finally, we join this with the results from the overall model to add metabolites and p-values for each of the top five significant sub-pathways.
For the top two sub-pathways with a significant interaction and parallel model we do the following steps:
Select the top two sub-pathways. If you want specific sub-pathways you can provide a list of the interested sub-pathways by changing inter_paths and par_paths to c(“subpathway_1”, “subpathway_2”, …).
Filter the “top_pathway_table” on each of the selected sub-pathways.
Display the metabolite name and p-value for the model
Each table will be displayed and saved to the “Outputs/Tables” folder under the name metabolitesInTop{{model type}}Pathway.
# 1. Read in table data and results from overall analysis.
table_data <- read.csv("../Outputs/Tables/Table_data.csv")
path_data <- read.csv("../Data/Results/Stratified_Male_ancova_results.csv", check.names = F)
# 2. Structure model type levels
if(sum(grepl("interaction",names(table_data))==0)){
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(sum(grepl("interaction",names(table_data)))>0){
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 3. Create top pathways table
top_pathway_table <- table_data %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, model, pval) %>%
arrange(model, pval) %>%
group_by(model) %>% slice(1:5) %>% ungroup() %>%
left_join(select(path_data, sub_pathway, chem_name, ends_with("_pval")),
by = "sub_pathway") %>%
distinct()
# 4. Create and save tables
# Get top two pathways
inter_paths = unique(top_pathway_table[top_pathway_table$model=="Interaction",]$sub_pathway)[1:2]
# Top pathway (Display and Save)
top_pathway_table %>%
filter(sub_pathway == inter_paths[1]) %>%
select(chem_name, interaction_pval) %>%
arrange(interaction_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", interaction_pval="Interaction P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(inter_paths[1]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3)
Glutamate Metabolism | |
|---|---|
Metabolite | Interaction P-Value |
glutamine | 0.023 |
carboxyethyl-GABA | 0.061 |
alpha-ketoglutaramate* | 0.072 |
N-methyl-GABA | 0.076 |
N-acetylglutamine | 0.091 |
N-acetylglutamate | 0.091 |
N-acetyl-aspartyl-glutamate (NAAG) | 0.093 |
glutamate | 0.233 |
4-hydroxyglutamate | 0.248 |
S-1-pyrroline-5-carboxylate | 0.356 |
top_pathway_table %>%
filter(sub_pathway == inter_paths[1]) %>%
select(chem_name, interaction_pval) %>%
arrange(interaction_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", interaction_pval="Interaction P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(inter_paths[1]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3) %>%
save_as_docx(path = "../Outputs/Tables/metabolitesInTop1InteractionPathway.docx")
# Second Pathway
top_pathway_table %>%
filter(sub_pathway == inter_paths[2]) %>%
select(chem_name, interaction_pval) %>%
arrange(interaction_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", interaction_pval="Interaction P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(inter_paths[2]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3)
Polyamine Metabolism | |
|---|---|
Metabolite | Interaction P-Value |
(N(1) + N(8))-acetylspermidine | 0.034 |
N-acetylputrescine | 0.042 |
spermidine | 0.061 |
spermine | 0.091 |
5-methylthioadenosine (MTA) | 0.131 |
putrescine | 0.304 |
4-acetamidobutanoate | 0.442 |
top_pathway_table %>%
filter(sub_pathway == inter_paths[2]) %>%
select(chem_name, interaction_pval) %>%
arrange(interaction_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", interaction_pval="Interaction P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(inter_paths[1]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3) %>%
save_as_docx(path = "../Outputs/Tables/metabolitesInTop2InteractionPathway.docx")
# parallel --------- Top 2 paralell
par_paths = unique(top_pathway_table[top_pathway_table$model=="Parallel",]$sub_pathway)[1:2]
# Top 1 Parallel
top_pathway_table %>%
filter(sub_pathway == par_paths[1]) %>%
select(chem_name, parallel_pval) %>%
arrange(parallel_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", parallel_pval="Parallel P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(par_paths[1]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3)
Hemoglobin and Porphyrin Metabolism | |
|---|---|
Metabolite | Parallel P-Value |
heme | 0.000 |
bilirubin (E,E)* | 0.028 |
biliverdin | 0.556 |
top_pathway_table %>%
filter(sub_pathway == par_paths[1]) %>%
select(chem_name, parallel_pval) %>%
arrange(parallel_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", interaction_pval="Parallel P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(par_paths[1]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3) %>%
save_as_docx(path = "../Outputs/Tables/metabolitesInTop1ParallelPathway.docx")
# Top 2 parallel
top_pathway_table %>%
filter(sub_pathway == par_paths[2]) %>%
select(chem_name, parallel_pval) %>%
arrange(parallel_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", parallel_pval="Parallel P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(par_paths[2]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3)
Drug - Topical Agents | |
|---|---|
Metabolite | Parallel P-Value |
salicylate | 0.006 |
2,6-dihydroxybenzoic acid | 0.011 |
hydroquinone sulfate | 0.062 |
top_pathway_table %>%
filter(sub_pathway == par_paths[2]) %>%
select(chem_name, parallel_pval) %>%
arrange(parallel_pval) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite", interaction_pval="Parallel P-Value") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(par_paths[2]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3) %>%
save_as_docx(path = "../Outputs/Tables/metabolitesInTop2ParallelPathway.docx")
Finally, we would like to save a csv file with all of the significant sub-pathways.
Read in the table data
Structure the model levels
Filter table only on the significant subpathways
Save the file in the “Outputs/Tables” folder under the name “signifiant_subpathways”.
# 1. Read in the table data
table_data <- read.csv("../Outputs/Tables/Table_data.csv")
# 2. Structure the cases
if(sum(grepl("interaction",names(table_data))==0)){
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(sum(grepl("interaction",names(table_data)))>0){
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 3. Filter table only on the significant subpathways
sig_data <- table_data %>%
filter(model != "None") %>%
mutate(combined_fisher_p_val = eval(cases)) %>%
select(sub_pathway, model, combined_fisher_p_val) %>%
arrange(sub_pathway)
# 4. Save the file
write.csv(sig_data, file = "../Outputs/Tables/significant_sub_pathways.csv")
In the above analysis, we tested the interaction, parallel, and treatment models for all sub-pathways. In this step, we tested if the sub-pathways were affected by any of the treatment groups. By first testing for all treatment groups, we can preserve the family-wise error rate of 0.05. However, we may be interested in specific pairwise comparisons, and running multiple tests for each comparison of interest increases our chances of making a type 1 error. If we subset the data to only focus on the subpathways with a significant p-value when testing all treatment groups, then we can help preserve the family-wise error rate at 0.05.
1.) Reading in the data used in the analysis above. You will need to make changes to the path to reflect the analysis data you want to use. For example the stratified or non stratified analysis data.
2.) Reading in the results from the analysis above. For example the stratified or non stratified results.
3.) Sub-setting the analysis data to only include the sub-pathways that were found to be significant. This first step is to find the significant sub-pathways. By default we are first going to look at the sub-pathways that had a significant interaction model.
Note: Our code is currently set up to look at the results withing the male stratum from the stratified analyse. You will want to change the file names to reflect the names of the files generated from the results and analysis data above.
# 1. Read in data from the analysis above
analysis_data <- read.csv("../Data/Processed/analysis_data_Male.csv", check.names = F)
# 2. Read in results from analysis above.
path_data <- read.csv("../Data/Results/Stratified_Male_ancova_results.csv", check.names = F)
# 3. Filter data on pathways that showed a significant interaction.
significant_subpathways <- filter(path_data, model == "Interaction", !is.na(interaction_pval))
new_analysis_data <- analysis_data %>%
select(PARENT_SAMPLE_NAME, GROUP_NAME, TIME1, Gender, significant_subpathways$chem_name)
The new analysis data will only include the sub-pathways with a significant interaction model. Now we want to look at all pairwise comparisons of our treatment variable. In the following chuck we are:
1.) Defining all of the meta data variables to include in the ancova_function. These values can stay the same throughout the whole pipeline.
2.) Defining all pairwise groups of the treatment variable. This is done with the get_compairisons function in dplyr. To make changes to this code you will need to change the variable name within this function.
For each pair of values, we will add an additional argument to the ancova function that will subset the data so that the analysis is only focused on the specified comparison. Within the for loop we are:
3.) Runnning the pairwise analysis. Any changes that are made to the arguments of the anocova_function will also need to be made in the for loop.
4.) Saving the pairwise results to Data/Results/Pairwise_Compairisons with the default file name, “PairwiseCompairison_{var1}and{var2} where {var1} and {var2} are the names of the values being compared.
Given the steps above, you will need to change the variable name in step 2 and the variable name inside the for loop in step 3.
# 1. Define all meta data variables to use in the analysis.
meta_analysis_variables <- meta_analysis_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# 2. Define all pairwise groups of the treatment variable
pairs <- new_analysis_data %>%
get_comparisons("GROUP_NAME") # Change argument as needed
# Run for loop for all of the pairwise compairsons
for (i in 1:length(pairs)){
# 3. Run pairwise ancova_function
pair <- pairs[[i]]
pairwise_path_results <- ancova_function(new_analysis_data,
chem_data = chem_data,
meta_analysis_variables = meta_analysis_variables,
treatment.only = F,
additive_vars=c("GROUP_NAME","TIME1"),
interaction_vars =c("GROUP_NAME*TIME1"),
treat_var = "GROUP_NAME",
GROUP_NAME %in% pair) # Pairwise comparison
# Change variable name
# To a name within the meta data
# 4. Save the pairwise comparison.
write.csv(pairwise_path_results, paste0("../Data/Results/Pairwise_Compairisons/PairwiseCompairison_",
pair[1], "_and_",pair[2],".csv"), row.names = FALSE)
}
In the chunk above, we tested each sub-pathway for several pairwise comparisons. We now want to see which models were significant at the sub-pathway level for each comparison and which models were significant. The following code:
1.) Finds all of the saved pairwise results from the previous step.
2.) Gets the name of each pairwise comparison based on the file name.
3.) Create an initial table for the first comparison showing the number of sub-pathways that were significant for each model type.
4.) Run a for loop for each pairwise comparison which creates the save table and merges the number of sub-pathways that were significant for each model into one table.
5.) Format table and save.
# 1. Find saved pairwise results
pairwise_files <- list.files("../Data/Results/Pairwise_Compairisons/")
# 2. Get the name of the each compairison based on the file name.
pairs <- sub("PairwiseCompairison_","",sub(".csv","",pairwise_files))
# 3. Create initial table for first pairwise compairison
tablePair <- read.csv(paste0("../Data/Results/Pairwise_Compairisons/",pairwise_files[1]), check.names = F) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct() %>%
count(model, name = pairs[1])
# 4. For loop merging pairwise compairison results.
for(i in 2:length(pairwise_files)){
tablePair <- tablePair %>%
full_join(read.csv(paste0("../Data/Results/Pairwise_Compairisons/",pairwise_files[i]), check.names = F) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct() %>%
count(model, name= pairs[i]),
by = "model")
}
# 5. Display and save table
tablePair %>%
mutate(model = factor(model, levels = c("Interaction", "Parallel", "Single", "None")),
across(-model, ~ifelse(is.na(.), 0, .))) %>%
arrange((model)) %>%
flextable() %>%
set_header_labels(model="Model Type") %>%
set_table_properties(layout = "autofit") %>%
theme_vanilla()
Model Type | 1902_and_Control | 1902_and_WT | Control_and_WT |
|---|---|---|---|
Interaction | 61 | 11 | 81 |
Parallel | 18 | 19 | 0 |
Single | 0 | 37 | 0 |
None | 2 | 14 | 0 |
tablePair %>%
mutate(model = factor(model, levels = c("Interaction", "Parallel", "Single", "None")),
across(-model, ~ifelse(is.na(.), 0, .))) %>%
arrange((model)) %>%
flextable() %>%
set_header_labels(model="Model Type") %>%
set_table_properties(layout = "autofit") %>%
theme_vanilla() %>%
save_as_docx(path = "../Outputs/Tables/SigPathsByModelType_pairwise.docx")
After running all of the pairwise comparisons and adding them to the Data/Results/Pairwise_Comparisons folder, we may want to look at the Fisher p-value for a specific model type for each pairwise comparison. For example, we may want to look at the interaction Fisher p-value across all comparisons for each sub-pathway.
In the following chunk we use a function called strat1_path_comparisons from the .R file in the R script folder. This function takes the following arguments:
Path: A path to the folder storing the results for all pairwise comparisons.
Model: The column of the results that you would like to compare. For example, we could compare interaction_fisher, parallel_fisher, or single_fisher across all pairwise comparisons.
The result will be a table with each row representing different sub pathways and each column is the fisher p-value for the specified model.
# 1. Provide path
path = "../Data/Results/Pairwise_Compairisons/"
# 2. Create table compairing specified pvalues
strat1_path_comparisons <- sub_path_pairwise_comparison(path = path, interaction_fisher)
Utilizing the table above, we can summarize our results in several ways. The next few chunks will be an example of how we can summarize the results. First, let’s look at the names of the headers in the stratified pathways comparisons table we created above.
# Names of Strat1_path_comparisons.
names(strat1_path_comparisons)[-1]
## [1] "1902_and_Control_interaction_fisher" "1902_and_WT_interaction_fisher"
## [3] "Control_and_WT_interaction_fisher"
Above are the names of the headers in the table comparing all pairwise comparisons for a specific model. In the first table, we are creating a table that only contains sub-pathways that show a significant p-value in the first comparison. To do this, will:
Filter the table so that only sub-pathways that were significant in the first comparison are displayed. To do this you will need to edit the code so that the correct comparison is used.
Select the sub_pathway and the specific comparison column. To do this you will need to edit the code so that the correct comparison is used.
Rename to displayed header name to the desired header.
# Display sig. pathways for Comparison 1
comp1 <-strat1_path_comparisons %>%
# 1. Filter on subpathways on the first comparison.
filter(`1902_and_Control_interaction_fisher` < 0.05) %>%
# 2. Select subpathway and comparison.
select(sub_pathway, `1902_and_Control_interaction_fisher` ) %>%
arrange(sub_pathway) %>%
flextable() %>%
# 3. Change header name to something more readable.
set_header_labels(sub_pathway="Sub Pathway", `1902_and_Control_interaction_fisher` ="Combined Fisher Prob P-val") %>%
colformat_double(digits = 3) %>%
set_table_properties(layout = "autofit") %>%
theme_vanilla()
# Display table
comp1
Sub Pathway | Combined Fisher Prob P-val |
|---|---|
Aminosugar Metabolism | 0.000 |
Ascorbate and Aldarate Metabolism | 0.000 |
Ceramides | 0.000 |
Chemical | 0.000 |
Diacylglycerol | 0.000 |
Dihydrosphingomyelins | 0.020 |
Endocannabinoid | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Dicarboxylate) | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Hydroxy) | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Long Chain Saturated) | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Medium Chain) | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Monounsaturated) | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Polyunsaturated) | 0.000 |
Fatty Acid Metabolism (Acyl Glycine) | 0.000 |
Fatty Acid, Branched | 0.000 |
Fatty Acid, Dicarboxylate | 0.000 |
Fatty Acid, Dihydroxy | 0.000 |
Fatty Acid, Monohydroxy | 0.000 |
Fructose, Mannose and Galactose Metabolism | 0.040 |
Gamma-glutamyl Amino Acid | 0.000 |
Glutamate Metabolism | 0.010 |
Glutathione Metabolism | 0.000 |
Glycerolipid Metabolism | 0.000 |
Glycine, Serine and Threonine Metabolism | 0.000 |
Glycolysis, Gluconeogenesis, and Pyruvate Metabolism | 0.030 |
Glycosyl PE | 0.035 |
Hexosylceramides (HCER) | 0.000 |
Ketone Bodies | 0.011 |
Leucine, Isoleucine and Valine Metabolism | 0.000 |
Long Chain Monounsaturated Fatty Acid | 0.000 |
Long Chain Polyunsaturated Fatty Acid (n3 and n6) | 0.000 |
Long Chain Saturated Fatty Acid | 0.000 |
Lysine Metabolism | 0.010 |
Lysophospholipid | 0.000 |
Lysoplasmalogen | 0.000 |
Medium Chain Fatty Acid | 0.000 |
Methionine, Cysteine, SAM and Taurine Metabolism | 0.010 |
Monoacylglycerol | 0.000 |
Nicotinate and Nicotinamide Metabolism | 0.000 |
Partially Characterized Molecules | 0.000 |
Phosphatidylcholine (PC) | 0.000 |
Phosphatidylethanolamine (PE) | 0.000 |
Phosphatidylinositol (PI) | 0.000 |
Phospholipid Metabolism | 0.000 |
Plasmalogen | 0.000 |
Polyamine Metabolism | 0.000 |
Primary Bile Acid Metabolism | 0.000 |
Purine Metabolism, Adenine containing | 0.000 |
Purine Metabolism, Guanine containing | 0.000 |
Pyrimidine Metabolism, Cytidine containing | 0.000 |
Pyrimidine Metabolism, Uracil containing | 0.000 |
Riboflavin Metabolism | 0.020 |
Secondary Bile Acid Metabolism | 0.000 |
Sphingomyelins | 0.000 |
Sphingosines | 0.000 |
Sterol | 0.000 |
Tocopherol Metabolism | 0.020 |
Tryptophan Metabolism | 0.000 |
Unknown Metabolite | 0.000 |
Vitamin A Metabolism | 0.000 |
Vitamin B6 Metabolism | 0.040 |
# Save table
comp1 %>%
save_as_docx(path = "../Outputs/Tables/SigPathwaysPairwiseComparison1.docx")
Now that we know how many sub-pathways are significant for the first comparison, we might want to check which sub-pathways were not significant in the first sub-pathway but were significant in the second comparison. Similar to the table above, we are going to:
Filter the table so that only sub-pathways that were not significant in the first comparison and significant in the second comparison are displayed. To do this you will need to edit the code so that the correct comparisons are used.
Select the sub_pathway and the specific comparison column.To do this you will need to edit the code so that the correct comparison is used.
Rename to displayed header name to the desired label.
# Table displays non-significant comparison 1 and significant comparison 2
comp2 <- strat1_path_comparisons %>%
# 1. Filter on subpathways on the first comparison.
filter(`1902_and_Control_interaction_fisher` >= 0.05, `1902_and_WT_interaction_fisher` < 0.05) %>%
# 2. Select subpathway and comparison.
select(sub_pathway, `1902_and_WT_interaction_fisher` ) %>%
arrange(sub_pathway) %>%
flextable() %>%
# 3. Change header name to something more readable.
set_header_labels(sub_pathway="Sub Pathway", `1902_and_WT_interaction_fisher` ="Combined Fisher Prob P-val") %>%
colformat_double(digits = 3) %>%
set_table_properties(layout = "autofit") %>%
theme_vanilla()
# Display table
comp2
Sub Pathway | Combined Fisher Prob P-val |
|---|---|
Creatine Metabolism | 0.020 |
Disaccharides and Oligosaccharides | 0.024 |
Fatty Acid, Amino | 0.010 |
Food Component/Plant | 0.000 |
Pentose Metabolism | 0.020 |
Pentose Phosphate Pathway | 0.000 |
Purine Metabolism, (Hypo)Xanthine/Inosine containing | 0.000 |
Pyrimidine Metabolism, Orotate containing | 0.020 |
# Save table
comp2 %>%
save_as_docx(path = "../Outputs/Tables/SigPathwaysPairwiseComparison2.docx" )
Now that we know how many sub-pathways are significant for the first two comparisons, we might want to check which sub-pathways were significant in the third sub-pathway but not the first two. This pattern can go on for as many comparisons as necessary. Additionally, we can augment the “filter” arguments to display the table as desired. Similar to the table above, we are going to:
Filter the table so that only sub-pathways that were not significant in the first comparison and significant in the second comparison are displayed. To do this you will need to edit the code so that the correct comparisons are used.
Select the sub_pathway and the specific comparison column.To do this you will need to edit the code so that the correct comparison is used.
Rename to displayed header name to the desired label.
# Table displays non-significant comparison 1 and 2 and significant comparison 3
comp3 <- strat1_path_comparisons %>%
# 1. Filter on subpathways on the first comparison.
filter(`1902_and_Control_interaction_fisher` >= 0.05, `1902_and_WT_interaction_fisher` >= 0.05,
Control_and_WT_interaction_fisher < 0.05) %>%
# 2. Select subpathway and comparison
select(sub_pathway, Control_and_WT_interaction_fisher) %>%
arrange(sub_pathway) %>%
flextable() %>%
# 3. Change header name to something more readable.
set_header_labels(sub_pathway="Sub Pathway", Control_and_WT_interaction_fisher="Combined Fisher Prob P-val") %>%
colformat_double(digits = 3) %>%
set_table_properties(layout = "autofit") %>%
theme_vanilla()
# Display table
comp3
Sub Pathway | Combined Fisher Prob P-val |
|---|---|
Alanine and Aspartate Metabolism | 0.000 |
Dipeptide | 0.000 |
Fatty Acid Metabolism (also BCAA Metabolism) | 0.010 |
Histidine Metabolism | 0.000 |
Lactoyl Amino Acid | 0.000 |
Mevalonate Metabolism | 0.020 |
Phenylalanine Metabolism | 0.000 |
Sphingolipid Synthesis | 0.000 |
TCA Cycle | 0.000 |
Thiamine Metabolism | 0.000 |
Tyrosine Metabolism | 0.000 |
Urea cycle; Arginine and Proline Metabolism | 0.000 |
# Save table
comp3 %>%
save_as_docx(path = "../Outputs/Tables/SigPathwaysPairwiseComparison3.docx")
We may want to look at specific sub-pathways for each comparison and all models. In the chunk below we use the sub_path_pairwise function defined in the R scripts folder. This function takes two arguments.
path: The path the pairwise comparison results folder.
… This is a masking parameter that allows you to filter the sub_pathway to the specific subpathway of interest.
In the example below we:
Provide a path to the Pairwise comparison results.
Create the table with all fisher p-values for the selected subpathways.
Displays the table
Saves table in ../Outputs/Tables/ with the file name SubpathwaysFisherProbs_allModels.
# 1. Establish path to pairwise results
path = "../Data/Results/Pairwise_Compairisons/"
# 2. Create the table with all fisher p-values for the selected subpathways.
tab_all <- sub_path_pairwise_comparison_all(path =path, sub_pathway %in% c("Vitamin B6 Metabolism", "Creatine Metabolism", "Pentose Metabolism"))
#2. Create the table with all fisher p-values for the selected subpathways
tab_all
Fisher Combined Probabilities | ||||
|---|---|---|---|---|
Pairwise Model | Sub Pathway | Interaction | Parallel | Single |
1902_and_Control | Creatine Metabolism | 0.53 | 0.00 | 0.83 |
1902_and_Control | Pentose Metabolism | 0.16 | 0.00 | 1.00 |
1902_and_Control | Vitamin B6 Metabolism | 0.04 | 0.00 | 0.73 |
1902_and_WT | Creatine Metabolism | 0.02 | 0.36 | 0.06 |
1902_and_WT | Pentose Metabolism | 0.02 | 0.10 | 0.80 |
1902_and_WT | Vitamin B6 Metabolism | 0.37 | 0.27 | 0.39 |
Control_and_WT | Creatine Metabolism | 0.00 | 0.32 | 0.36 |
Control_and_WT | Pentose Metabolism | 0.00 | 0.02 | 0.98 |
Control_and_WT | Vitamin B6 Metabolism | 0.00 | 0.62 | 0.90 |
# 4. Saves table in ../Outputs/Tables/ with the file name SubpathwaysFisherProbs_allModels.
tab_all %>%
save_as_docx(path = "../Outputs/Tables/SubpathwaysFisherProbs_allModels.docx")
Visualizations of the data can help us see the underlying trends. One useful visualization is boxplots and line graphs.
Suppose there is a specific subpathway that we would like to visualize. In that case, we can compare the treatment groups for each metabolite within that sub-pathway using the subpathway_box plots function defined in the R script. T his function takes the following arguments.
Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the sub pathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Pathway (pathway): This is the name of the sub-pathway of interest.
X: This the the name of the variable in the meta data that is used for the X axis of the box plots.
Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.
… : At the end of this function you can provide additional arguments for filtering the analysis data. In the examples below we are filtering so the box plots only focus on males and only two treatment groups.
The output of the subpathway_boxplot function is box plots for each of the metabolites within the specified subpathway. The labels for the box plots are default based on the data created inside the subpathway_boxplots function. Since these box plots utilize ggplot, we can customize all labels using the. labs function from ggplot.
Here we look at the boxplots for each metabolite within a specified subpathway utilizing the subpathway_boxplots function. In the following chunk we:
Read in the analysis data
Read in the chemical annotation data
Specify the sub-pathway of interest
Run the subpathway_boxplots function
Customize labels for the generated box plot.
Save the final boxplots to the Outputs/Figures folder with the name “Metabolite_boxplots_{{subpathway}}” where {{subpathway}} is the name of the specified sub-pathway.
# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data_Male.csv", check.names = F)
# 2. Reach in chemical annotation data
chem_data <- read.csv("../Data/Processed/chem_data.csv", check.names = F)
# 3. Specify subpathway of interest.
subpath = "Gamma-glutamyl Amino Acid"
# 4. Run the subpathway_boxplots function
box_1 <- subpathway_boxplots(analysis_data = analysis_data,
chem_data = chem_data,
subpathway = subpath,
X = TIME1,
groupBy = GROUP_NAME,
# Additional analysis data filtering options.
Gender == "Male", GROUP_NAME != "Control")
# 5. Customize labels for the boxplots
box_1 +
labs(x = "Time", y="Scaled Intensity", color="Treatment Group",
title = "Metabolite boxplots for the Gamma-glutamyl Amino Acid Subpathway")
# 6. Save boxplots. You can change the file type for example change pdf to png.
ggsave(filename = paste0("../Outputs/Plots/Metabolite_boxplots_",subpath,".pdf"))
## Saving 10 x 10 in image
Another useful visualization is line plots. These plots are useful for seeing trends between an ordered variable and the scaled intensity. Line plots can be useful for analyzing trends across the different treatment groups. Like the box plots, we are interested in seeing trends for all metabolites within a sub-pathway. To do this, we can utilize the “subpathway_lineplots” function defined in the R scripts. This function takes the following arguments.
Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the subpathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Pathway (pathway): This is the name of the subpathway of interest.
X: This the name of the variable in the metadata that is used for the X axis of the line plots.
Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.
Below is a demonstration of utilizing the line plots to see trends between treatment groups at the metabolite level. In the chunk below we:
Read in the analysis data
Read in the chemical annotation data
Prep data for line plots. To produce the lines we must provide a numeric variable for the x-axis. If you are looking at trends across an ordinal categorical variable you will need to convert the variable to be numeric. For example, if we want to see the trend in time between treatment groups, where time takes the ordinal values of “Pre-symptomatic”, “onset of symptoms”, and “end of symptoms”, then we will need to assign these names to the values of 1,2, and 3 respectively. Additionally, we will need to make any other desired adjustments to our data at this step.
Define the sub-pathway of interest.
Run the “subpathway_lineplots function”
Edit the aesthetics of plots. Similar to the box plots, we are using the labs function to edit the labels for the plots. Additionally, we are utilizing the scale_x_continuous function at add labels to the tick marks for the plot that correspond with the original values.
Save the plots in the folder “Outputs/Plots” under the file name “linePlots_{{subpathway}} where {{subpathway}} is the name of the subpathway defined in step 4.
# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# 2. Reach in chemical annotation data
chem_data <- read.csv("../Data/Processed/chem_data.csv", check.names = F)
# 3. Prep data for line plots
analysis_data <- analysis_data %>%
filter(Gender=="Male") %>%
mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
TIME1 =="Onset" ~ 2,
TIME1 == "End" ~ 3))
# 4. Define subpathway of interest.
subpathway = "Glycerolipid Metabolism"
# 5. Run subpathway_line plots function.
line_plots <- subpathway_lineplots(analysis_data = analysis_data,
chem_data = chem_data,
subpathway = subpathway,
X = TIME1,
groupBy = GROUP_NAME)
# 6. Edit asthetics of plots
line_plots +
scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")
## `geom_smooth()` using formula = 'y ~ x'
# 7. Save plots
ggsave(filename = paste0("../Outputs/Plots/LinePlots_",subpathway,".pdf"))
## Saving 12 x 12 in image
## `geom_smooth()` using formula = 'y ~ x'
We may also want to stratify the line plots by a specific condition. In the example below we stratify the line plots by gender in 10 steps.
Read in the analysis data
Read in the chemical annotation data
Define the subpathway of interest.
Prep data for line plots and filter on specific conditions in the data. In this example we are filtering the data on males. Additionally, the same processing steps from above need to be completed.
Filter analysis data on second condition following the same instructions from step 3. In this example we are filtering the data on females.
Run the subpathway_lineplots function for males
Run the subpathway_lineplots function for females
Use the ggarrange function to combine these two plots into one.
Save the plot in Outputs/Plots folder with the LinePlots_{{subpathway}}_stratified.
# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# 2. Reach in chemical annotation data
chem_data <- read.csv("../Data/Processed/chem_data.csv", check.names = F)
# 3. Define subpathway of interest.
subpathway = "Glycerolipid Metabolism"
# 4. Prep data for line plots
analysis_data_male <- analysis_data %>%
filter(Gender=="Male") %>%
mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
TIME1 =="Onset" ~ 2,
TIME1 == "End" ~ 3))
# 5. Prep data for line plots
analysis_data_female <- analysis_data %>%
filter(Gender=="Female") %>%
mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
TIME1 =="Onset" ~ 2,
TIME1 == "End" ~ 3))
# 6. Create plot for males
males <- subpathway_lineplots(analysis_data = analysis_data_male,
chem_data = chem_data,
subpathway = subpathway,
X = TIME1,
groupBy = GROUP_NAME)+
scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")
# 7. Create plot for females
females <- subpathway_lineplots(analysis_data = analysis_data_male,
chem_data = chem_data,
subpathway = subpathway,
X = TIME1,
groupBy = GROUP_NAME)+
scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")
# 8. Display stratified plot
ggarrange(males,
females,
labels = c("Males", "Females"),
nrow = 2,
common.legend = T,
vjust = 0
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# 9. Save plots
ggsave(filename = paste0("../Outputs/Plots/LinePlots_",subpathway,"_stratified.pdf"))
## Saving 12 x 12 in image